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Abstract 

The paper presents a language model that 
develops syntactic structure and uses it to 
extract meaningful information from the 
word history, thus enabling the use of 
long distance dependencies. The model as¬ 
signs probability to every joint sequence 
of words-binary-parse-structure with head¬ 
word annotation. The model, its proba¬ 
bilistic parametrization, and a set of ex¬ 
periments meant to evaluate its predictive 
power are presented. 

1 Introduction 

The main goal of the proposed project is to develop 
a language model(LM) that uses syntactic structure. 
The principles that guided this proposal were: 

• the model will develop syntactic knowledge as a 
built-in feature; it will assign a probability to every 
joint sequence of words-binary-parse-structure; 

• the model should operate in a left-to-right man¬ 
ner so that it would be possible to decode word lat¬ 
tices provided by an automatic speech recognizer. 
The model consists of two modules: a next word 
predictor which makes use of syntactic structure as 
developed by a parser. The operations of these two 
modules are intertwined. 

2 The Basic Idea and Terminology 

Consider predicting the word barked in the sen¬ 
tence: 

the dog I heard yesterday barked again. 

A 3-gram approach would predict barked from 
(heard, yesterday) whereas it is clear that the 
predictor should use the word dog which is out¬ 
side the reach of even 4-grams. Our assumption 
is that what enables us to make a good predic¬ 
tion of barked is the syntactic structure in the 



the dog I heard yesterday barked 
Figure 1: Partial parse 



<s> w_l ... w_p .w_q ... w_r w_{r+l} ... w_k w_{k+l} .w_n </s> 

Figure 2: A word-parse k-prefix 

past. The correct partial parse of the word his¬ 
tory when predicting barked is shown in Figure [l]. 
The word dog is called the headword of the con¬ 
stituent ( the (dog (. . .))) and dog is an exposed 
headword when predicting barked topmost head¬ 
word in the largest constituent that contains it. The 
syntactic structure in the past filters out irrelevant 
words and points to the important ones, thus en¬ 
abling the use of long distance information when 
predicting the next word. Our model will assign a 
probability P(W,T) to every sentence W with ev¬ 
ery possible binary branching parse T and every 
possible headword annotation for every constituent 
of T. Let IT be a sentence of length l words to 
which we have prepended <s> and appended </s> 
so that wo =<s> and wj+i =</s>. Let Wk be the 
word k-prefix Wq ... Wk of the sentence and WkTk 
the word-parse k-prefix. To stress this point, a 
word-parse k-prefix contains only those binary trees 
whose span is completely included in the word k- 
prefix, excluding wq =<s>. Single words can be re¬ 
garded as root-only trees. Figure |] shows a word- 
parse k-prefix; h_0 . . h_{-m} are the exposed head¬ 
words. A complete parse — Figure || is any bi¬ 
nary parse of the w\.. .wi </s> sequence with the 
restriction that </s> is the only allowed headword. 








<s> w_1 . w_l </s> 


Figure 3: Complete parse 

Note that (w i.. .wi) needn’t be a constituent, but 
for the parses where it is, there is no restriction on 
which of its words is the headword. 

The model will operate by means of two modules: 

• PREDICTOR predicts the next word w k +\ given 
the word-parse k-prehx and then passes control to 
the PARSER; 

• PARSER grows the already existing binary 
branching structure by repeatedly generating the 
transitions adjoin-left or adjoin-right until it 
passes control to the PREDICTOR by taking a null 
transition. 

The operations performed by the PARSER en¬ 
sure that all possible binary branching parses with 
all possible headword assignments for the w\... w k 
word sequence can be generated. They are illus¬ 
trated by Figures The following algorithm de¬ 
scribes how the model generates a word sequence 
with a complete parse (see Figures |||(| for notation): 

Transition t; //a PARSER transition 

generate <s>; 

do{ 

predict next_word; //PREDICTOR 

do{ //PARSER 

if(T_{-l> != <s> ) 

if(h_0 == </s>) t = adjoin-right; 

else t = {adjoin-{left,right}, null}; 
else t = null; 

}while(t != null) 

}while(!(h_0 == </s> && T_{-1} == <s>)) 
t = adjoin-right; // adjoin <s>; DONE 

It is easy to see that any given word sequence with a 
possible parse and headword annotation is generated 
by a unique sequence of model actions. 

3 Probabilistic Model 

The probability P(W,T) can be broken into: 

P(W,T) = 

a=\ P(ti/ W k, Wfc_iT fe _i, t\ ... tfLi)] where: 

• Wk-iTk-i is the word-parse (fc — l)-prefix 

• Wk is the word predicted by PREDICTOR 

• Nk — 1 is the number of adjoin operations the 
PARSER executes before passing control to the 
PREDICTOR (the A^-th operation at position k is 
the null transition); Nk is a function of T 


h_{-2} h_{-l} h_0 



Figure 4: Before an adjoin operation 


h'_{-l } = h_{-2} h _Q = h_{-l ) 



Figure 5: Result of adjoin-left 


li' _ { - 1 }=h_{ -2 } h'_() = H_0 



Figure 6: Result of adjoin-right 

• tf denotes the i-th PARSER operation carried 
out at position k in the word string; 

if € (adjoin-left,adjoin-right}, i < Nk , 

=null, i = N k 

Our model is based on two probabilities: 

Piwk/Wk-iTk-!) ( 1 ) 

P{t k i /w k ,W k _ 1 T k _ 1 ,t\...t k i _i) (2) 

As can be seen (w k , W k -\T k _\, t\ ... 1^_i) is one 
of the N k word-parse k-prefixes of W k T k ,i = l,N k 
at position k in the sentence. 

To ensure a proper probabilistic model we have 
to make sure that ([l) and (||) are well defined con¬ 
ditional probabilities and that the model halts with 
probability one. A few provisions need to be taken: 

• P(mill/W k T k ) = 1, if T_{-1> == <s> ensures 
that <s> is adjoined in the last step of the parsing 
process; 

• P(adjoin-right/WfcTfc) = 1, if h_0 == </s> 
ensures that the headword of a complete parse is 
</s>; 

•3e > Os.t. P(w k =</s>/Wk-iTk-i) > e,VW fc -i2fc-i 
ensures that the model halts with probability one. 

3.1 The first model 

The first term ([[]) can be reduced to an n-gram LM, 
P(Wk/Wk-lT k -i ) = P{w k /Wk -1 ■ ■ .Wk- n+ l). 

A simple alternative to this degenerate approach 
would be to build a model which predicts the next 
word based on the preceding p-1 exposed headwords 
and n-1 words in the history, thus making the fol¬ 
lowing equivalence classification: 

[W k T k } = {h_0 .. h_{-p+2> ,w k -i..w k —n+1} • 

















The approach is similar to the trigger LM( Lau93 ), 
the difference being that in the present work triggers 
are identified using the syntactic structure. 

3.2 The second model 

Model (U) assigns probability to different binary 
parses of the word k-prefix by chaining the ele¬ 
mentary operations described above. The workings 
of t he PARSE R are very similar to those of Spat¬ 
ter ( |Jelinek94 ). It can be brought to the full power 
of Spatter by changing the action of the adjoin 
operation so that it takes into account the termi¬ 
nal/nonterminal labels of the constituent proposed 
by adjoin and it also predicts the nonterminal la¬ 
bel of the newly created constituent; PREDICTOR 
will now predict the next word along with its POS 
tag. The best equivalence classification of the W k T k 
word-parse k-prefix is yet to be determined. The 
Collins parser ( pollins96 ) shows that dependency- 
grammar- like bigram constraints may be the most 
adequate, so the equivalence classification [WkTk] 
should contain at least {h_0, 


4 Preliminary Experiments 

Assuming that the correct partial parse is a func¬ 
tion of the word prefix, it makes sense to compare 
the word level perplexity(PP) of a standard n-grarn 
LM with that of the P(w k /W k -iT k -i) model. We 
developed and evaluated four LMs: 

• 2 bigram LMs P(w k /W k -iT k _ 1 ) = P(w k /w k - i) 
referred to as W and w, respectively; w k -i is the pre¬ 
vious (word, POStag) pair; 

• 2 P{w k /W k -\T k -i) = P(w k /h 0 ) models, re¬ 
ferred to as H and h, respectively; ho is the previous 
exposed (headword, POS/non-term tag) pair; the 
parses used in this model were those assigned man¬ 


ually in the Penn Treebank (Marcus95) after under¬ 
going headword percolation and binarization. 

All four LMs predict a word w k and they were 
implemented using the Maximum Entropy Model¬ 
ing Toolkit^ ( |Ristad97 ). The constraint templates 
in the {W,H} models were: 

4 <= <*>_<*> <?>; 2 <= <?>_<*> <?>; 

2 <= <?>_<?> <?>; 8 <= <*>_<?> <?>; 
and in the {w,h} models they were: 

4 <= <*>_<*> <?>; 2 <= <?>_<*> <?>; 

<*> denotes a don’t care position, <?>_<?> a (word, 
tag) pair; for example, 4 <= <?>_<*> <?> will trig¬ 
ger on all ((word, any tag), predicted-word) pairs 
that occur more than 3 times in the training data. 
The sentence boundary is not included in the PP cal¬ 
culation. Table [I] shows the PP results along with 


itp: //ftp.cs.princeton.edu / pub/packages/memt 


the number of parameters for each of the 4 models 
described . 


LM 

PP 

par am 

LM 

PP 

param 

W 

352 

208487 

w 

419 

103732 

H 

292 

206540 

h 

410 

102437 


Table 1: Perplexity results 
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